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Abstract 



Based on cumulative distribution functions, Fourier series expansion and Kolmogorov tests, we present a simple 
method to display probability densities for data drawn from a continuous distribution. It is often more efficient than 
using histograms. 
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1. Introduction 

We address the simple problem of displaying an empirical probability density (PD) f(x) from data for a continuous 
variable x. Commonly this is done using histograms. This is appropriate when x is discrete, because there is then a 
natural scale. But in case of a continuous variable x, one is faced with choosing binsizes. This is a frustrated problem: 
One would like to keep the binsize small for a high resolution, but big to suppress statistical fluctuations. Here we 
present a method 1 1 1 to by-pass the problem. It is based on the cumulative distribution function (CDF) 



Given a time series of n real numbers (data), a parameter free empirical estimate (ECDF), is well-known: The step 
function F(x) defined by increasing by 1/n at each data point. This does not help directly in getting an estimate of the 
probability density, because the derivative is a sum of Dirac delta functions. 

One needs some kind of interpolation of the CDF. This is no fun, as one has to decide whether the interpolation 
of 2, 3, 4, or k points will work best. In contrast, plotting a histogram is simple and robust, but not a smooth function. 
Our way out relies on Fourier expansion of the ECDF F(x). This leads to the desired smooth approximation as long 
as the expansion is sufficiently short, but will imitate every wiggle of the data, when carried too far. Therefore, one 
needs a cut-off criterion. We base this on the Kolmogorov test , which tells us whether the difference between the 
ECDF and an analytical approximation of the CDF is explained by chance. Fortran code for our procedure HI is 
available from the CPC Library. 

2. (Peaked) Cumulative Distribution Functions 

Assume we generate n random numbers xi, ■ ■ ■, x„. We re-arrange the jc,- in increasing order (tti, . . .,7t„ a permu- 
tation of 1, ... ,n): 




(1) 
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An estimator for the distribution function F(x) is the ECDF 



F(x) — - for x„. < X < , / = 0, 1 , . . . , n - 1 , n, 



(3) 



and by definition x„^ = -oo, x„^^^^ - +oo. Fig. [T| shows an ECDF from 100 Gaussian distributed random numbers 
generated for the probabihty density 



1 / X- 

expf 



V2^ I 2 

together with the exact CDF. The CDF is in this case determined by the error function: 



(4) 




Figure 1 : ECDF from 100 Gaussian distributed random numbers together with the exact CDF. 



G(x) = J' dx'g(x') = 1 + 1 erf j^j . (5) 

The probabihty density of events is encoded in the slope of the ECDF. This makes it often difficult to read off high 
probability regions and, in particular, the median. This can be improved by switching to the peaked CDF 

Fp(x) ^{F(x) for Fix) < ^ ; 1 - Fix) for Fix) > ^ . (6) 

By construction the maximum of the peaked CDF is at the median xi and Fpixi/2) - 1/2. Therefore, Fpix) has two 
advantages: The median is clearly exhibited and the accuracy of the ordinate is doubled. It looks a bit like a PD, but 
is in essence still the integrated PD. An example from 10,000 Gaussian random numbers is shown in Fig.|2] 

3. Kolmogorov Test 

Do empirical and exact CDFs of our two figures agree? The Kolmogorov test answers this question (for a review 
see PI). It returns the probability Q, that the difference between the analytical CDF and an ECDF from statistically 
independent data is due to chance. If the analytical CDF is known and the data are sampled from this distribution, Q 
is a uniformly distributed random variable in the range Q < Q < 1. Turned around, if one is not sure about the exact 
CDF, or the data, or both, and Q is small (say, Q < 10"^) one concludes that the difference between the proposed CDF 
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Figure 2: Peaked ECDF from the 10,000 Gaussian random numbers versus exact Gaussian peaked CDF. The arrows indicate 70% and 95% 
confidence intervals. 



and the data is presumably not due to chance. Kolmogorov's ingenious test rehes just on the maximum difference 
between the ECDF and the CDF: _ 

A = max\F(x)-F(x)\ . (7) 

The test yields, respectively, Q = 0.19 and Q - 0.78 for the samples used in Fig. [T] and [2] Both values signal 
consistency between CDF and data. 

4. Probability Densities 

Our method 1 1 1 to construct an empirical probability density (EPD) from an ECDF consists of two steps: 

1. Define as an initial approximation to F(x) a differentiable, monotonically increasing function Fq{x). 

2. Fourier expand the remainder until the Kolmogorov test yields Q > Qcut =1/2 (there may be some flexibiUty 
in lowering gcut)- 

For Fq{x) we require 

Fci(x) = for X < a and 1 for x > b , (8) 

where [a,b] has to lie within the range of the data. For PDs with support on a compact interval, or with fast fall- 
off like for a Gaussian distribution, the natural choice is a = x„j and b = Xj^^. In case of slow fall-off, like for a 
Cauchy distribution, or other distributions with outliers, one has to restrict the analysis to [a,b] regions, which are 
well populated by data. 

We denote the ECDF of the range [a,b] by Fatix). As for Fq{x), by construction Fahix) - for x < a and 1 for 
X > b. Our aim is to construct a PD estimator fahix) from Fahix). In the following we restrict our choice of Fo(x) to 
the straight line, 

Fo(x) = ^5—^ for a<x<b, (9) 
b — a 

which keeps the approach simple. More elaborate definitions will likely give improvements in a number of situations, 
but may discourage applications. Once Fq(x) is defined, the remainder of the ECDF is given by 



Rix) = Fahix) - Fo(x) . 



(10) 



B.A. Berg / Physics Procedia 00 (2013) 1-^ 



4 



We expand R(x) into the Fourier series 

R(-)-p(i)sin{^-^f^Y (11) 

The cosine terms are not present due to the boundary conditions R(a) - R(b) - 0. The Fourier coefficients follow 
from 




In our case R{x) is the difference of a step function and a linear function. The integrals over the flat regions of the step 
function are easily calculated, and the d{i) obtained by adding them up. 

The Fourier expansion is useless for too large values of m, because it will then reproduce all statistical fluctuations 
of the data. To get around this problem, we perform the Kolmogorov test first between Fab(x) and f o(x) (m = 0), and 
then each time m is incremented from m — » m + 1. Once Q > Qcut - 1/2 is reached, we know that the information 
left in the data is statistical noise and the expansion is terminated. The thus obtained smooth estimate of the CDF, 

^^estimate(jc) = Fo{x) + R(x) , (13) 

yields fab(^) by differentiation. 

We attach error bars to the estimate of the PD by dividing the (unsorted) original data into jackknife blocks 
and repeat the analysis for each block. Comparing the thus obtained function values, error bars follow in the usual 
jackknife way. An example for the Gaussian distribution follows. See | IJ for more examples: The Cauchy distribution 
and autocorrelated data from U(l) lattice gauge theory. The histogram for the Gaussian distribution is shown in Fig.|3] 



0.5 
0.4 
0.3 
0.2 
0.1 




Histogram 



-2 -1.5 -1 -0.5 0.5 1 1.5 2 

X 



Figure 3: Histogram of 5 1 bins for 2 000 random numbers generated according to the Gaussian distribution. 



(the error bars follow from the variance p{l-p)of the bimodal distribution with p - h{i)/n). Fig.|4]gives our estimate 
g(x) of the PD obtained from the same data with the described method. We used a - x„, and b - x„^^. Q - 0.97 was 
reached with m-A{Q- 0.056 with m - 3). Twenty jackknife blocks were used to calculate the error bars. 

5. Summary and Conclusions 

Based on Fourier expansion and Kolmogorov tests, we introduced a method for constructing continuous proba- 
bility density functions from data. We did not develop a statistically rigorous approach. We address physicists and 
others, who do not hesitate to use whatever works. 



B.A. Berg / Physics Pmcedia 00 (2013) 1-^ 



5 




^ ' ' ' ' ' ' ' J 

-2 -1.5 -1 -0.5 0.5 1 1.5 2 



X 

Figure 4: Estimate g(x) for the data of tlie previous figure. 



Our results were obtained with a straight line as initial approximation for the CDF. There is certainly space for 
improvement at the price of giving up some of the simplicity. With our Qcut = 1/2 rule, we are slightly overexpanding 
the Fourier expansion. In the average Q should be 1/2, but all our values are Q > 1/2. That gives some flexibility to 
lower 2cut when the m of the Fourier expansion appears to be too large. 

There are many open questions. Given the initial approximation, we construct a smooth Fourier expansion of the 
remainder, that is consistent with the data, using the ordering in which the long wave lengths modes come first. Obvi- 
ously, the result of this procedure is not the only analytical function, which is consistent with the data. Which ordering 
of the serious expansion or other complete function system gives the smoothest approximation (smallest number of 
terms) consistent with the data? Do systems of monotonically increasing functions exist, which are complete for the 
expansion of monotonically increasing functions? 

Kernel density estimates |l3] [H are in spirit similar (but by no means identical) to our method. A comparison 
remains to be carried out. 
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